Note: You can use this file as you ‘working document’ where you can try out various investigation ideas and keep notes about your findings. How you use and structure this file is up to you. It is recommended that you keep notes about what you are investigating and what you find as this will make the process of creating your presentation and report easier. Please note that you do not need to submit this file as part of your group project.

options(repos = c(CRAN = "https://cloud.r-project.org"))
library(tidyverse)
# Add any other libraries here
library(dplyr)
library(tidyr)

Analysis Steps Load the Dataset:

Imported from the CSV file. Summary statistics reviewed. Data Cleaning:

Handled missing values. Removed features with excessive missing data. Standardized data for regression. Exploratory Data Analysis (EDA):

Visualized relationships between features and crime rates. Identified patterns and key features. Regression Modeling:

Linear Regression: Modeled continuous crime rates based on features. Logistic Regression: Converted the crime rate into binary classification (e.g., high/low) for logistic analysis. Evaluation:

Analyzed model performance metrics (e.g., accuracy, R-squared, confusion matrix).

Classic usage scenarios: In the field of crime analysis and prediction, the classic application scenarios of the Communities and Crime dataset mainly focus on predicting the crime rate of communities through regression models.

Researchers utilize the socio-economic, law enforcement, and demographic data in this dataset to identify the key factors influencing the crime rate and make predictions. This analysis not only helps understand the drivers of the crime rate but also provides data support for policymakers to formulate more effective prevention strategies.

Solve academic problems: The Communities and Crime dataset addresses a fundamental issue in criminology research: how to quantify and predict the crime rate in communities. By integrating multi-dimensional socio-economic and demographic data, this dataset provides a rich resource for the academic community to explore the complex relationships between crime rates and various social factors. This not only drives the research on crime prediction models but also offers scientific basis for decision-making in the fields of social policies and public safety.

Practical application: In practical applications, the Communities and Crime dataset is widely used in urban planning and public safety management. For instance, local governments and law enforcement agencies can utilize the analysis results of this dataset to optimize resource allocation and enhance community security. Additionally, non-profit organizations and community groups can also leverage these data to design targeted social intervention programs in order to reduce crime rates and improve the community environment.

Type What to Add Why It Improves Your Project
1️⃣ Correlation heatmap Visualize variable relationships clearly
2️⃣ Distribution plots Show how target and predictors vary
3️⃣ Variable selection (feature reduction) Simplify model and improve interpretability
4️⃣ Cross-validation Make model evaluation more robust
5️⃣ Residual analysis Check model errors for bias
6️⃣ Map or geographic visualization (optional) Show crime rates per area (if you find state/county info)
7️⃣ Model comparison plot Visually compare models
8️⃣ Deeper ethics discussion Shows awareness of real-world implications

Ethical Considerations
The Communities and Crime dataset contains sensitive attributes such as race, income, and family structure.
These variables can reflect systemic inequalities and should be treated carefully: - Avoid using models like these for individual prediction or policing.
- Be aware of bias amplification: if biased data (e.g., due to unequal policing) is used to train a model, predictions will also be biased.
- Focus on understanding community-level patterns and informing equitable policy decisions rather than punitive actions.

Ethical data science means asking not just “Can we predict it?” but also “Should we?”.

library(tidyverse)
options(repos = c(CRAN = "https://cloud.r-project.org"))

# Load column names
names_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.names"
name_lines <- readLines(names_url)
name_lines <- name_lines[grepl("^@attribute", name_lines)]
col_names <- gsub("^@attribute\\s+", "", name_lines)
col_names <- sub("\\s+.*$", "", col_names)
col_names <- col_names[col_names != ""]

# Load dataset
data_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data"
crime <- read.csv(data_url, header = FALSE, na.strings = "?", col.names = col_names)

# Clean dataset
crime <- crime %>%
  select(-state, -county, -community, -communityname, -fold)

# Replace missing values with column means
library(dplyr)

data_clean <- crime %>%
  mutate(across(where(is.numeric),
                ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# Print the new shape of the cleaned data
summary(data_clean)
##    population      householdsize     racepctblack     racePctWhite   
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.01000   1st Qu.:0.3500   1st Qu.:0.0200   1st Qu.:0.6300  
##  Median :0.02000   Median :0.4400   Median :0.0600   Median :0.8500  
##  Mean   :0.05759   Mean   :0.4634   Mean   :0.1796   Mean   :0.7537  
##  3rd Qu.:0.05000   3rd Qu.:0.5400   3rd Qu.:0.2300   3rd Qu.:0.9400  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   racePctAsian     racePctHisp     agePct12t21      agePct12t29    
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0400   1st Qu.:0.010   1st Qu.:0.3400   1st Qu.:0.4100  
##  Median :0.0700   Median :0.040   Median :0.4000   Median :0.4800  
##  Mean   :0.1537   Mean   :0.144   Mean   :0.4242   Mean   :0.4939  
##  3rd Qu.:0.1700   3rd Qu.:0.160   3rd Qu.:0.4700   3rd Qu.:0.5400  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##   agePct16t24       agePct65up       numbUrban          pctUrban     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.2500   1st Qu.:0.3000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.2900   Median :0.4200   Median :0.03000   Median :1.0000  
##  Mean   :0.3363   Mean   :0.4232   Mean   :0.06407   Mean   :0.6963  
##  3rd Qu.:0.3600   3rd Qu.:0.5300   3rd Qu.:0.07000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##    medIncome         pctWWage       pctWFarmSelf      pctWInvInc    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2000   1st Qu.:0.4400   1st Qu.:0.1600   1st Qu.:0.3700  
##  Median :0.3200   Median :0.5600   Median :0.2300   Median :0.4800  
##  Mean   :0.3611   Mean   :0.5582   Mean   :0.2916   Mean   :0.4957  
##  3rd Qu.:0.4900   3rd Qu.:0.6900   3rd Qu.:0.3700   3rd Qu.:0.6200  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    pctWSocSec      pctWPubAsst       pctWRetire       medFamInc     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.3500   1st Qu.:0.1425   1st Qu.:0.3600   1st Qu.:0.2300  
##  Median :0.4750   Median :0.2600   Median :0.4700   Median :0.3300  
##  Mean   :0.4711   Mean   :0.3178   Mean   :0.4792   Mean   :0.3757  
##  3rd Qu.:0.5800   3rd Qu.:0.4400   3rd Qu.:0.5800   3rd Qu.:0.4800  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    perCapInc       whitePerCap     blackPerCap      indianPerCap   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2200   1st Qu.:0.240   1st Qu.:0.1725   1st Qu.:0.1100  
##  Median :0.3000   Median :0.320   Median :0.2500   Median :0.1700  
##  Mean   :0.3503   Mean   :0.368   Mean   :0.2911   Mean   :0.2035  
##  3rd Qu.:0.4300   3rd Qu.:0.440   3rd Qu.:0.3800   3rd Qu.:0.2500  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##   AsianPerCap      OtherPerCap       HispPerCap      NumUnderPov     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.1900   1st Qu.:0.1700   1st Qu.:0.2600   1st Qu.:0.01000  
##  Median :0.2800   Median :0.2500   Median :0.3450   Median :0.02000  
##  Mean   :0.3224   Mean   :0.2847   Mean   :0.3863   Mean   :0.05551  
##  3rd Qu.:0.4000   3rd Qu.:0.3600   3rd Qu.:0.4800   3rd Qu.:0.05000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##  PctPopUnderPov  PctLess9thGrade   PctNotHSGrad     PctBSorMore    
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.110   1st Qu.:0.1600   1st Qu.:0.2300   1st Qu.:0.2100  
##  Median :0.250   Median :0.2700   Median :0.3600   Median :0.3100  
##  Mean   :0.303   Mean   :0.3158   Mean   :0.3833   Mean   :0.3617  
##  3rd Qu.:0.450   3rd Qu.:0.4200   3rd Qu.:0.5100   3rd Qu.:0.4600  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  PctUnemployed      PctEmploy       PctEmplManu     PctEmplProfServ 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2200   1st Qu.:0.3800   1st Qu.:0.2500   1st Qu.:0.3200  
##  Median :0.3200   Median :0.5100   Median :0.3700   Median :0.4100  
##  Mean   :0.3635   Mean   :0.5011   Mean   :0.3964   Mean   :0.4406  
##  3rd Qu.:0.4800   3rd Qu.:0.6275   3rd Qu.:0.5200   3rd Qu.:0.5300  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   PctOccupManu    PctOccupMgmtProf MalePctDivorce   MalePctNevMarr  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2400   1st Qu.:0.3100   1st Qu.:0.3300   1st Qu.:0.3100  
##  Median :0.3700   Median :0.4000   Median :0.4700   Median :0.4000  
##  Mean   :0.3912   Mean   :0.4413   Mean   :0.4612   Mean   :0.4345  
##  3rd Qu.:0.5100   3rd Qu.:0.5400   3rd Qu.:0.5900   3rd Qu.:0.5000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   FemalePctDiv     TotalPctDiv       PersPerFam       PctFam2Par    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.3600   1st Qu.:0.3600   1st Qu.:0.4000   1st Qu.:0.4900  
##  Median :0.5000   Median :0.5000   Median :0.4700   Median :0.6300  
##  Mean   :0.4876   Mean   :0.4943   Mean   :0.4877   Mean   :0.6109  
##  3rd Qu.:0.6200   3rd Qu.:0.6300   3rd Qu.:0.5600   3rd Qu.:0.7600  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   PctKids2Par     PctYoungKids2Par  PctTeen2Par     PctWorkMomYoungKids
##  Min.   :0.0000   Min.   :0.000    Min.   :0.0000   Min.   :0.0000     
##  1st Qu.:0.4900   1st Qu.:0.530    1st Qu.:0.4800   1st Qu.:0.3900     
##  Median :0.6400   Median :0.700    Median :0.6100   Median :0.5100     
##  Mean   :0.6207   Mean   :0.664    Mean   :0.5829   Mean   :0.5014     
##  3rd Qu.:0.7800   3rd Qu.:0.840    3rd Qu.:0.7200   3rd Qu.:0.6200     
##  Max.   :1.0000   Max.   :1.000    Max.   :1.0000   Max.   :1.0000     
##    PctWorkMom        NumIlleg          PctIlleg       NumImmig      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00   Min.   :0.00000  
##  1st Qu.:0.4200   1st Qu.:0.00000   1st Qu.:0.09   1st Qu.:0.00000  
##  Median :0.5400   Median :0.01000   Median :0.17   Median :0.01000  
##  Mean   :0.5267   Mean   :0.03629   Mean   :0.25   Mean   :0.03006  
##  3rd Qu.:0.6500   3rd Qu.:0.02000   3rd Qu.:0.32   3rd Qu.:0.02000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00   Max.   :1.00000  
##  PctImmigRecent    PctImmigRec5     PctImmigRec8    PctImmigRec10   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1600   1st Qu.:0.2000   1st Qu.:0.2500   1st Qu.:0.2800  
##  Median :0.2900   Median :0.3400   Median :0.3900   Median :0.4300  
##  Mean   :0.3202   Mean   :0.3606   Mean   :0.3991   Mean   :0.4279  
##  3rd Qu.:0.4300   3rd Qu.:0.4800   3rd Qu.:0.5300   3rd Qu.:0.5600  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  PctRecentImmig    PctRecImmig5     PctRecImmig8    PctRecImmig10   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0300   1st Qu.:0.0300   1st Qu.:0.0300   1st Qu.:0.0300  
##  Median :0.0900   Median :0.0800   Median :0.0900   Median :0.0900  
##  Mean   :0.1814   Mean   :0.1821   Mean   :0.1848   Mean   :0.1829  
##  3rd Qu.:0.2300   3rd Qu.:0.2300   3rd Qu.:0.2300   3rd Qu.:0.2300  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  PctSpeakEnglOnly PctNotSpeakEnglWell PctLargHouseFam  PctLargHouseOccup
##  Min.   :0.0000   Min.   :0.0000      Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:0.7300   1st Qu.:0.0300      1st Qu.:0.1500   1st Qu.:0.1400   
##  Median :0.8700   Median :0.0600      Median :0.2000   Median :0.1900   
##  Mean   :0.7859   Mean   :0.1506      Mean   :0.2676   Mean   :0.2519   
##  3rd Qu.:0.9400   3rd Qu.:0.1600      3rd Qu.:0.3100   3rd Qu.:0.2900   
##  Max.   :1.0000   Max.   :1.0000      Max.   :1.0000   Max.   :1.0000   
##  PersPerOccupHous PersPerOwnOccHous PersPerRentOccHous PctPersOwnOccup 
##  Min.   :0.0000   Min.   :0.0000    Min.   :0.0000     Min.   :0.0000  
##  1st Qu.:0.3400   1st Qu.:0.3900    1st Qu.:0.2700     1st Qu.:0.4400  
##  Median :0.4400   Median :0.4800    Median :0.3600     Median :0.5600  
##  Mean   :0.4621   Mean   :0.4944    Mean   :0.4041     Mean   :0.5626  
##  3rd Qu.:0.5500   3rd Qu.:0.5800    3rd Qu.:0.4900     3rd Qu.:0.7000  
##  Max.   :1.0000   Max.   :1.0000    Max.   :1.0000     Max.   :1.0000  
##  PctPersDenseHous PctHousLess3BR      MedNumBR        HousVacant     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0600   1st Qu.:0.4000   1st Qu.:0.0000   1st Qu.:0.01000  
##  Median :0.1100   Median :0.5100   Median :0.5000   Median :0.03000  
##  Mean   :0.1863   Mean   :0.4952   Mean   :0.3147   Mean   :0.07682  
##  3rd Qu.:0.2200   3rd Qu.:0.6000   3rd Qu.:0.5000   3rd Qu.:0.07000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##   PctHousOccup    PctHousOwnOcc    PctVacantBoarded PctVacMore6Mos  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.6300   1st Qu.:0.4300   1st Qu.:0.0600   1st Qu.:0.2900  
##  Median :0.7700   Median :0.5400   Median :0.1300   Median :0.4200  
##  Mean   :0.7195   Mean   :0.5487   Mean   :0.2045   Mean   :0.4333  
##  3rd Qu.:0.8600   3rd Qu.:0.6700   3rd Qu.:0.2700   3rd Qu.:0.5600  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  MedYrHousBuilt   PctHousNoPhone   PctWOFullPlumb   OwnOccLowQuart  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.3500   1st Qu.:0.0600   1st Qu.:0.1000   1st Qu.:0.0900  
##  Median :0.5200   Median :0.1850   Median :0.1900   Median :0.1800  
##  Mean   :0.4942   Mean   :0.2645   Mean   :0.2431   Mean   :0.2647  
##  3rd Qu.:0.6700   3rd Qu.:0.4200   3rd Qu.:0.3300   3rd Qu.:0.4000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   OwnOccMedVal    OwnOccHiQuart       RentLowQ        RentMedian    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0900   1st Qu.:0.0900   1st Qu.:0.1700   1st Qu.:0.2000  
##  Median :0.1700   Median :0.1800   Median :0.3100   Median :0.3300  
##  Mean   :0.2635   Mean   :0.2689   Mean   :0.3464   Mean   :0.3725  
##  3rd Qu.:0.3900   3rd Qu.:0.3800   3rd Qu.:0.4900   3rd Qu.:0.5200  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    RentHighQ        MedRent       MedRentPctHousInc MedOwnCostPctInc
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000    Min.   :0.0000  
##  1st Qu.:0.220   1st Qu.:0.2100   1st Qu.:0.3700    1st Qu.:0.3200  
##  Median :0.370   Median :0.3400   Median :0.4800    Median :0.4500  
##  Mean   :0.423   Mean   :0.3841   Mean   :0.4901    Mean   :0.4498  
##  3rd Qu.:0.590   3rd Qu.:0.5300   3rd Qu.:0.5900    3rd Qu.:0.5800  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000    Max.   :1.0000  
##  MedOwnCostPctIncNoMtg NumInShelters       NumStreet       PctForeignBorn  
##  Min.   :0.0000        Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.2500        1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0600  
##  Median :0.3700        Median :0.00000   Median :0.00000   Median :0.1300  
##  Mean   :0.4038        Mean   :0.02944   Mean   :0.02278   Mean   :0.2156  
##  3rd Qu.:0.5100        3rd Qu.:0.01000   3rd Qu.:0.00000   3rd Qu.:0.2800  
##  Max.   :1.0000        Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##  PctBornSameState PctSameHouse85   PctSameCity85    PctSameState85  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.4700   1st Qu.:0.4200   1st Qu.:0.5200   1st Qu.:0.5600  
##  Median :0.6300   Median :0.5400   Median :0.6700   Median :0.7000  
##  Mean   :0.6089   Mean   :0.5351   Mean   :0.6264   Mean   :0.6515  
##  3rd Qu.:0.7775   3rd Qu.:0.6600   3rd Qu.:0.7700   3rd Qu.:0.7900  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   LemasSwornFT     LemasSwFTPerPop  LemasSwFTFieldOps LemasSwFTFieldPerPop
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000    Min.   :0.0000      
##  1st Qu.:0.06966   1st Qu.:0.2175   1st Qu.:0.9247    1st Qu.:0.2463      
##  Median :0.06966   Median :0.2175   Median :0.9247    Median :0.2463      
##  Mean   :0.06966   Mean   :0.2175   Mean   :0.9247    Mean   :0.2463      
##  3rd Qu.:0.06966   3rd Qu.:0.2175   3rd Qu.:0.9247    3rd Qu.:0.2463      
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000    Max.   :1.0000      
##  LemasTotalReq     LemasTotReqPerPop PolicReqPerOffic  PolicPerPop    
##  Min.   :0.00000   Min.   :0.0000    Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.09799   1st Qu.:0.2152    1st Qu.:0.3436   1st Qu.:0.2175  
##  Median :0.09799   Median :0.2152    Median :0.3436   Median :0.2175  
##  Mean   :0.09799   Mean   :0.2152    Mean   :0.3436   Mean   :0.2175  
##  3rd Qu.:0.09799   3rd Qu.:0.2152    3rd Qu.:0.3436   3rd Qu.:0.2175  
##  Max.   :1.00000   Max.   :1.0000    Max.   :1.0000   Max.   :1.0000  
##  RacialMatchCommPol PctPolicWhite   PctPolicBlack     PctPolicHisp   
##  Min.   :0.0000     Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.6894     1st Qu.:0.727   1st Qu.:0.2205   1st Qu.:0.1349  
##  Median :0.6894     Median :0.727   Median :0.2205   Median :0.1349  
##  Mean   :0.6894     Mean   :0.727   Mean   :0.2205   Mean   :0.1349  
##  3rd Qu.:0.6894     3rd Qu.:0.727   3rd Qu.:0.2205   3rd Qu.:0.1349  
##  Max.   :1.0000     Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##  PctPolicAsian    PctPolicMinor    OfficAssgnDrugUnits NumKindsDrugsSeiz
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000     Min.   :0.0000   
##  1st Qu.:0.1149   1st Qu.:0.2592   1st Qu.:0.07555     1st Qu.:0.5561   
##  Median :0.1149   Median :0.2592   Median :0.07555     Median :0.5561   
##  Mean   :0.1149   Mean   :0.2592   Mean   :0.07555     Mean   :0.5561   
##  3rd Qu.:0.1149   3rd Qu.:0.2592   3rd Qu.:0.07555     3rd Qu.:0.5561   
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000     Max.   :1.0000   
##  PolicAveOTWorked    LandArea          PopDens       PctUsePubTrans  
##  Min.   :0.000    Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.306    1st Qu.:0.02000   1st Qu.:0.1000   1st Qu.:0.0200  
##  Median :0.306    Median :0.04000   Median :0.1700   Median :0.0700  
##  Mean   :0.306    Mean   :0.06523   Mean   :0.2329   Mean   :0.1617  
##  3rd Qu.:0.306    3rd Qu.:0.07000   3rd Qu.:0.2800   3rd Qu.:0.1900  
##  Max.   :1.000    Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##    PolicCars      PolicOperBudg     LemasPctPolicOnPatr LemasGangUnitDeploy
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000      Min.   :0.0000     
##  1st Qu.:0.1631   1st Qu.:0.07671   1st Qu.:0.6986      1st Qu.:0.4404     
##  Median :0.1631   Median :0.07671   Median :0.6986      Median :0.4404     
##  Mean   :0.1631   Mean   :0.07671   Mean   :0.6986      Mean   :0.4404     
##  3rd Qu.:0.1631   3rd Qu.:0.07671   3rd Qu.:0.6986      3rd Qu.:0.4404     
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000      Max.   :1.0000     
##  LemasPctOfficDrugUn PolicBudgPerPop  ViolentCrimesPerPop
##  Min.   :0.00000     Min.   :0.0000   Min.   :0.000      
##  1st Qu.:0.00000     1st Qu.:0.1951   1st Qu.:0.070      
##  Median :0.00000     Median :0.1951   Median :0.150      
##  Mean   :0.09405     Mean   :0.1951   Mean   :0.238      
##  3rd Qu.:0.00000     3rd Qu.:0.1951   3rd Qu.:0.330      
##  Max.   :1.00000     Max.   :1.0000   Max.   :1.000
cat("Data cleaned. Shape:", nrow(data_clean), "rows and", ncol(data_clean), "columns\n")
## Data cleaned. Shape: 1994 rows and 123 columns
library(corrplot)
## corrplot 0.95 loaded
# Compute correlations among numeric columns
crime_num <- data_clean %>% select(where(is.numeric))
corr_matrix <- cor(crime_num, use = "complete.obs")

# Plot heatmap for top correlated features
corrplot(corr_matrix, type = "lower", tl.cex = 0.2, tl.col = "black")

# Target Variable Distribution
ggplot(data_clean, aes(x = ViolentCrimesPerPop)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Violent Crime Rate",
       x = "Violent Crimes per Population", y = "Frequency")

ggplot(data_clean, aes(x = PctPopUnderPov, y = ViolentCrimesPerPop)) +
geom_point(alpha = 0.5, color = "purple") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Violent Crime vs Poverty", x = "Poverty (%)", y = "Violent Crime Rate")
## `geom_smooth()` using formula = 'y ~ x'

library(caret)      
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest) 
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(123)
train_index <- createDataPartition(data_clean$ViolentCrimesPerPop, p = 0.8, list = FALSE)
train <- data_clean[train_index, ]
test  <- data_clean[-train_index, ]

# Correlation with target
corrs <- sapply(data_clean, function(x) cor(x, data_clean$ViolentCrimesPerPop, use = "complete.obs"))
head(sort(corrs, decreasing = TRUE), 10)
## ViolentCrimesPerPop            PctIlleg        racepctblack         pctWPubAsst 
##           1.0000000           0.7379565           0.6312636           0.5746653 
##        FemalePctDiv         TotalPctDiv      MalePctDivorce      PctPopUnderPov 
##           0.5560319           0.5527774           0.5254073           0.5218765 
##       PctUnemployed      PctHousNoPhone 
##           0.5042346           0.4882435
summary(data_clean$ViolentCrimesPerPop)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.070   0.150   0.238   0.330   1.000
# Check a few relationships
plot(data_clean$medIncome, data_clean$ViolentCrimesPerPop,
     main = "Crime vs Median Income",
     xlab = "Median Income", ylab = "Violent Crime Rate")

plot(data_clean$pctWWage, data_clean$ViolentCrimesPerPop,
     main = "Crime vs WWage",
     xlab = "WWage %", ylab = "Violent Crime Rate")

# Build two models

# (a) Linear Regression
lm_model <- lm(ViolentCrimesPerPop ~ ., data = train)
lm_pred <- predict(lm_model, newdata = test)

# (b) Random Forest
rf_model <- randomForest(ViolentCrimesPerPop ~ ., data = train, ntree = 150)
rf_pred <- predict(rf_model, newdata = test)

# Evaluate models
# Calculate R-squared (how well model fits) and RMSE (error)
lm_r2 <- cor(lm_pred, test$ViolentCrimesPerPop)^2
rf_r2 <- cor(rf_pred, test$ViolentCrimesPerPop)^2

lm_rmse <- sqrt(mean((lm_pred - test$ViolentCrimesPerPop)^2))
rf_rmse <- sqrt(mean((rf_pred - test$ViolentCrimesPerPop)^2))

cat("Linear Regression: R2 =", lm_r2, " RMSE =", lm_rmse, "\n")
## Linear Regression: R2 = 0.6788732  RMSE = 0.1298727
cat("Random Forest: R2 =", rf_r2, " RMSE =", rf_rmse, "\n")
## Random Forest: R2 = 0.7044982  RMSE = 0.1251186
# Visualize predictions 
results <- data.frame(
  Actual = test$ViolentCrimesPerPop,
  Predicted = rf_pred
)

ggplot(results, aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Random Forest Predictions vs Actual",
       x = "Actual Violent Crime Rate",
       y = "Predicted Crime Rate")

# Feature importance (which factors matter most) 
varImpPlot(rf_model, n.var = 10, main = "Most Important Features")

# Residual Analysis (Model Diagnostics)
residuals_rf <- test$ViolentCrimesPerPop - rf_pred

ggplot(data.frame(residuals_rf), aes(x = residuals_rf)) +
  geom_histogram(bins = 30, fill = "darkorange", color = "white") +
  labs(title = "Residual Distribution (Random Forest)",
       x = "Residuals", y = "Count")

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p <- ggplot(crime, aes(x = PctPopUnderPov, y = ViolentCrimesPerPop)) +
  geom_point(alpha = 0.6, color = "purple") +
  labs(title = "Interactive: Poverty vs Violent Crime",
       x = "Poverty (%)", y = "Violent Crime Rate")


ggplotly(p)